Project: US NICS Gun Registration and Census Data Analysis

Table of Contents

Introduction

In this project, I will be analysing the gun registration dataset and census dataset. The original source of datasets are from Census and NICS. The data from NICS (National Instant Criminal Background Check System) is used for background checks for the prospective buyer to buy firearms and/or explosives. The excel file of dataset for NICS is downloaded from nics.xlsx. The census dataset provides information of the people in all states of United states with most of the data is from the year 2016 and can be downloaded as csv file from US census data.csv.

Utilizing this dataset, I will be exploring the trend of gun registration over the years for the states of United States. Further analysis will be done to see the correlation of census data like population and income with the gun registration.

Data Wrangling

The first step is to analyse both the datasets, looking into the variables and features, info, first few rows using head(), and checking for the missing and duplicated data.

Looking into the census data

General Properties

Based on the initial observation, the census data has 85 rows and 52 columns. Column 1 and column 2 are Fact and Fact Notes whereas column 3 to 52 are states of United states. The data has 3 duplicated columns and around 20 missing values.

Several important points to note:

  1. Fact column shows total population estimates for the year 2010 and 2016 for all the states.
  2. Few rows show percentage of population of age under 5 years , under 18 years and 65+ years for years 2010 and 2016.
  3. Other rows shows the percentage of population of different race for year 2016.
  4. Fact rows also show the percentage of population having education for high school and bachelor or higher.
  5. Other information are employment, housing units, annual payroll, median household income, per capita income, owner occupied housing units, civilian labor force for 2011-2015.
  6. Men and women, veteran owned firms, retail sales for the year 2012.
  7. Other rows are population per square mile and land area in square miles for the year 2010 and FIPS Code.
  8. Fact Note column shows the information about the fact column.

More detailed information can be found in the website Census.

Looking into the guns registration dataset

General Properties

Based on the initial observation, the guns registration data has 12485 rows and 27 columns. Each column permit, permit recheck represents the type of transaction submitted to the National Instant Criminal Background Check System (NICS). The NICS Firearm Background Checks happened every month and Year for each of the states.

The transactions are divided based on the type of firearm - handgun, long gun and other (refers to frames, receivers and other firearm that are neither handguns nor long guns. Multiple denotes background check for more than one type of firearm background check while admin denotes the administrative checks.

Other transactions corresponding to handgun, long gun and other are:

  1. Pre-Pawn
  2. Redemption
  3. Returned/Disposition
  4. Rentals
  5. Private Sale
  6. Return to Seller -Private sale
  7. totals- is the combination of all transactions mentioned above.

Note: All these are background checks requested by official-licenced Federal Firearms Licensee (FFL) or criminal law enforcement agent prior to the issuance of a firearm-related permit or transfer. Much of the detailed information can be found from this website NICS_details.

Data Cleaning

(a) Census data

Now that we looked into the dataset, let us closely look into the duplicated dataset.

As we can see, all the rows are filled with NaN, so we will drop these rows and will look into the data to confirm that only three rows that are duplicated are removed.

Next thing to point out about the dataset is the type of each column. All the datatypes are objects and column states has string characters in it. Therefore, except Fact and Fact Note column which should be object type, all other columns should be numeric. Therefore, unnecessary string characters has other characters like $ and , including quotes needs to be removed before changing it into numeric datatype.

Also, as we looked into "Fact Note" column, most of the rows are filled with NaN and there are few notes corresponding to it. Since only fact column is neccessary to define the data, Fact Note is removed.

Next thing is to look for missing data.

Looking into the missing values, there are in total 17 rows of states that are filled with null values and therefore are removed.

The number of rows and columns left after removing the missing and duplicated values are now 65 rows and 51 columns.

Next thing I am working on is removing the unncessary characters and converting to numeric type for all the columns except the Fact column (which is a description). Other columns are representing the population data and the percentage that must be in numeric form.

While attempting to change the datatype to numeric above, it runs an error showing: "ValueError: Unable to parse string ""01"" at position 64", let us closely look into row 64 and see how it looks like.

Note:to avoid the cell showing error, I commented out.

FIPS code in row 64 has a quotes due to which the data could not convert in the numeric form. Because the analysis I am working on involves information about the population data and not the code, this row is removed from the dataset.

Let us further give a close look into the "Fact" column and see what rows are required for the analysis.

One of the question that is explored from this dataset is looking into the population in different states for the year 2010 and 2016 and its correlation with the NICS dataset.

Other question is impact of income (median household, per capita) on gun registration dataset. Hence, I will be selecting the rows that highlight those data.

While attempting again to change the datatype to numeric above, it runs an error showing: "ValueError: Unable to parse string "Z" at position 10", let us closely look into row 10 and see how the data looks like. There was another value error showing "ValueError: Unable to parse string "D" at position 42" which is also fixed.

Note:to avoid the cell showing error above, I commented out.

It shows that there is a string Z in the 10th row for those states whose population in percentage is unknown. I will be changing to 0 as we do not have data for the percentage of the Native Hawaiin Race corresponding to four states Maine, Michigan, Vermont and West Virginia.

Now that all the rows are changed to numeric without any further error, next thing is look into the types of the data in census dataframe just to make sure if everything looks okay.

It shows that all the data is cleaned for further analysis. Now that all the column data for census column is cleaned with selection of the Fact column, next thing is to look into details about gun registration dataset.

(b) NICS dataset

The data looks clean with month and state to be string and other columns are integers and floats.

Looking into the month column, it shows the year and the month. Therefore, it is converted to datetime object.

From the info, we find that there are missing data in most of the columns except month, state and totals column. Let us further look into if there is any relationship of the "totals" column with other columns.

Totals column represents the total transactions of the guns registration dataframe. Let us further check into it.

Looking into these relation, it is confirmed that the totals column is the sum of all the individual transaction corresponding to each of the state. Now that the data is cleaned with the correct datatype, let us begin with exploring the data.

Exploratory Data Analysis

Research Question 1: What is the overall trend of gun registration and which states show highest growth?

To answer this question, we will first look into the gun registration in all the states and narrow down to the states which has higher registration for analysis.

Because we find that totals is the sum of all the transactions and does not involve any missing data, therefore I will create a dataframe which has month, state and totals column for further analysis. Then I will look into the distribution in various states of US.

We find that there are 54 states and for each month and year (1998-2017), there is a gun registration data leading to 12485 rows. In census dataset, the states are 50. Let us first match the states from census and guns_dist dataframe and select the rows which match states in both dataframes.

This narrows down to 11350 rows from 12485 rows with states matching with census dataset.

Next thing I am looking is the distribution and see the counts belonging corresponding to the total gun registration.

Looking into the histogram, most of the data lies within the range 100K. Let us further look into the statistics.

The national mean is 23K whereas minimum is 6 while maximum is 541K. This means that data has higher variation. Let us further look into the box plot.

Based on this visualisation, Kentucky has increased gun registration. Based on this, I have narrowed down to 10 states namely California, Florida, Illinois, Indiana, Kentucky, North Carolina, Texas, Pennsylvania, Ohio, Utah to see the distribution over the years.

Let us look closely at these states looking into the trend from 1998 till 2017.

Looking into the plot, following are points to note:

  1. Kentucky shows the increase in the gun distribution over the years from 1998 to 2018 except in 2017 where it showed the dip.

  2. North Carolina shows a sharp increase over 500K for the year 2014 while it is lower in number for the other years.

  3. California shows the increase in the distribution mainly after 2012.

  4. Texas shows some increase in 2013 and following years while Utah also show the increase for the years 2010-2012. Rest all looks like slight increase in gun registration over the years.

Now that we look into the plot, selecting only to these five states Kentucky, North Carolina, California, Texas and Utah gives the clear picture of the trend over the years.

  1. The plot shows that Utah and North Carolina has lower registration except for the years 2010-12 for Utah and 2014 for North Carolina.

  2. Kentucky, California and Texas shows the increase in the gun registration and therefore we will look over these three states for subsequent analysis.

Research Question 2 : What census data is most associated with high gun per capita?

We find the states showing highest growth in gun registration, let us look further look into the population of these states and see if there is any impact of population with the gun registration.

For that, I am choosing three states, California, Kentucky and Texas which shows an increase in gun registration.

Only first two rows shows the distribution of population for the year 2010 and 2016, let us select those rows and for three cities: Kentucky, California and Texas.

Looking into the data, we want to change the columns states to rows and for Fact column, we need to extract the year and population and split in two seperate columns.

Now that we got the dataframe for the year 2010 and 2016 with population and gun registration. Let us look into how gun registration varies with population for these states.

Summary from the pairplot graphs for states California, Kentucky and Texas for the years 2010 and 2016

  1. The population data from census column is given one data point for the complete year while gun registration data is available for all 12 months for the year 2010 and 2016. This is shown for each of the 12 points corresponding to the population data.

  2. The plots clearly shows that the total gun registration is higher in 2016 as compared to 2010 for all these states.

  3. The other point is for Kentucky, the gun registration has increased from 2010 to 2016 although the population is not that high compared to other states California and Texas.

  4. For California and Texas, there is increase in population and the gun registration as well. Although this figure gives some insight about the gun registration pattern with population, more census data is required to provide better insights about the correlation.

Looking into the statistics for years 2010 and 2016:

There is an increase in gun registration and population from 2010 to 2016. However, California and Kentucky shows an increase of more than 100K registration although population of Kentucky increase by 97K while population in California increase by approx 200K.

Role of income level with the gun registration

While looking into the California dataset, because we have only one household income for every state from the census dataset which is $61818, therefore all the income corresponding to the total gun registration for years 2011-2015 is the same.

The statistics shows the per capita income for Kentucky is least, however the gun registration is maximum. California has high per capita income but gun registration is least compared to other two states.

Similarly for median household income is least for Kentucky, however the gun registration is maximum. California has high median income but gun registration is least compared to other two states.

Both jointplots for median household income and per capita income (in USD) shows that Kentucky has higher gun registration although the income is least.

Conclusions

Results

  1. Exploratory data analysis of these datasets highlights the trend for the guns registration over the years for all the states of US.

  2. Interesting results are observed for Kentucky showing the higher gun registration over the years. Further analysis is done with the population and income insights.

  3. The data also shows that North Carolina shows low gun registration except for the year 2014 which shows more than 500K registrations.

  4. Other states have also higher gun registration over the years although the total is less than 100K.

Limitations

  1. The descriptive statistics of population and income with the gun registration for different states provide some insights, however the correlation is weak. This means that further hypothesis testing needs to be done to prove if there is strong correlation between those variables.

  2. Various transactions for gun registration has a lot of missing data. More detailed insights would have been derived if we used different statistical ways for filling those null values and find the correlation using hypothesis testing.

  3. Also, as we can see that there is only one data point corresponding to each Fact column of census. More data on census would have provided a clear picture about these correlations.